Skip to content

[AISOS-741] Add auto review primitive to Forge#119

Open
JoshSalomon wants to merge 27 commits into
forge-sdlc:mainfrom
JoshSalomon:forge/aisos-741
Open

[AISOS-741] Add auto review primitive to Forge#119
JoshSalomon wants to merge 27 commits into
forge-sdlc:mainfrom
JoshSalomon:forge/aisos-741

Conversation

@JoshSalomon

Copy link
Copy Markdown
Contributor

Summary

This PR implements the Auto-Review Primitive, a new capability that enables automated quality gates for skill execution in Forge. When a skill includes a review.md file, the system automatically spawns a reviewer agent after execution to validate output quality, providing structured feedback and retry loops until the work is approved or max retries are exhausted. This feature reduces manual review burden and catches quality issues earlier in the development cycle.

Changes

Container Review Loop Engine (containers/review.py)

  • Created Verdict enum (APPROVED/REJECTED) and ReviewConfig/ReviewCycleData dataclasses
  • Implemented parse_review_config() for parsing YAML frontmatter from review.md files
  • Implemented detect_review_md() with project override precedence (mirrors skill resolver pattern)
  • Implemented parse_verdict() for case-insensitive APPROVED/REJECTED marker extraction
  • Implemented write_cycle_file() for JSON output to .forge/{step-name}/review_cycle_N.json

Container Entrypoint Integration (containers/entrypoint.py)

  • Added run_review_loop() orchestrating the reviewer agent spawn, verdict parsing, and retry logic
  • Added run_reviewer_agent() and run_worker_with_feedback() for feedback injection
  • Added load_conversation_history() for loading prior context on retries
  • Integrated terminal progress display with TTY detection (_print_review_progress())

Observability Infrastructure (src/forge/observability/)

  • Created ReviewCyclePoller class with async polling loop for detecting review cycle files during container execution
  • Created ReviewCycleRecorder with mode switching (log/copy/None) for recording detected cycles
  • Created ReviewJiraNotifier for posting formatted Jira comments with rate limiting
  • Added Prometheus metrics: forge_review_cycles_total, forge_review_verdicts_total, forge_review_duration_seconds

Container Runner Integration (src/forge/sandbox/runner.py)

  • Extended ContainerResult dataclass with review_cycles field
  • Added step_name parameter to ContainerRunner.run() method
  • Implemented _poll_review_cycles() background task for async polling during execution
  • Implemented _sweep_review_cycles() for post-execution sync sweep to catch missed files

Workflow Node Updates

  • Updated all ContainerRunner-invoking nodes to pass step_name parameter:
    • implementation.py: step_name='implement_task'
    • local_reviewer.py: step_name='local_review'
    • ci_evaluator.py: step_name='analyze_ci'/'fix_ci'
    • code_review.py: step_name='code_review'
    • docs_updater.py: step_name='update_docs'
    • implement_review.py: step_name='implement_review_analyze'/'implement_review_fix'
    • plan_bug_fix.py: step_name='plan_bug_fix'
    • rca_analysis.py: step_name='analyze_bug'/'reflect_rca'
    • rebase.py: step_name='rebase'

Configuration (src/forge/config.py)

  • Added AUTO_REVIEW_MAX_RETRIES: default max retry attempts (default: 3)
  • Added AUTO_REVIEW_POLL_INTERVAL: polling interval in seconds (default: 5)
  • Added AUTO_REVIEW_RECORD_POLLED_FILES: recording mode ('log'/'copy'/None)

Sample Review Files

  • Created skills/default/implement-task/review.md with generic code review criteria
  • Created skills/default/local-code-review/review.md with pre-PR review criteria
  • Created skills/aisos/implement-task/review.md as project-specific override example

Documentation

  • Created docs/reference/review-md-schema.md with formal schema specification
  • Created docs/guide/auto-review.md with comprehensive user documentation
  • Updated docs/skills/authoring.md, docs/skills/index.md, docs/skills/defaults.md to reference auto-review
  • Updated docs/reference/config.md with new configuration settings

Dependencies

  • Added aiofiles>=24.1.0 to pyproject.toml for async file operations

Implementation Notes

  • Precedence Logic: detect_review_md() mirrors the skill resolver pattern—project-specific overrides (skills/{project}/{skill}/review.md) take precedence over defaults (skills/default/{skill}/review.md)
  • Verdict Parsing: Uses position-based precedence when both APPROVED and REJECTED markers are present (first occurrence wins)
  • Async + Sync Polling: The runner uses async polling during execution plus a synchronous post-execution sweep to ensure no files are missed on fast container exits
  • Rate Limiting: Jira notifier includes configurable rate limiting (default 30s) to prevent comment spam
  • TTY Detection: Terminal progress output only displays when stdout is a TTY to avoid polluting logs in CI environments
  • Graceful Degradation: Malformed YAML in review.md logs a warning and falls back to defaults (per BR-006)

Testing

  • Added 200+ unit tests across new modules:
    • tests/unit/containers/test_review.py: 57 tests for review.py functions
    • tests/unit/observability/test_review_poller.py: 32 tests for async polling
    • tests/unit/observability/test_review_recorder.py: 34 tests for recording modes
    • tests/unit/observability/test_review_notifier.py: 31 tests for Jira notification
    • tests/unit/sandbox/test_runner_review_polling.py: 25+ tests for runner integration
  • All 630+ existing unit tests continue to pass
  • Manual verification of terminal progress display format

Related Tickets


Generated by Forge SDLC Orchestrator

Forge added 21 commits July 2, 2026 12:37
… ReviewCycleData dataclasses

Detailed description:
- Created containers/review.py with:
  - Verdict StrEnum with APPROVED and REJECTED values
  - ReviewConfig dataclass with max_retries (int, default 3) and instructions (str)
  - ReviewCycleData dataclass with all specified fields
  - parse_review_config() function for parsing YAML frontmatter
- Function supports AUTO_REVIEW_MAX_RETRIES env var fallback (SC-005)
- Malformed YAML logs warning and returns defaults (SC-012/BR-006)
- Added 23 unit tests in tests/unit/containers/test_review.py

Closes: AISOS-2053
Implemented detect_review_md(skill_name, ticket_key, skills_base) -> Path | None
that locates review.md files with project override precedence:

- Mirrors precedence logic from src/forge/skills/resolver.py
- Checks project-specific path first: skills/{project}/{skill_name}/review.md
- Falls back to default: skills/default/{skill_name}/review.md
- Extracts project key from ticket_key (e.g., 'AISOS-123' -> 'aisos')
- Returns None if review.md doesn't exist (SC-010)
- Uses is_file() to correctly handle edge case of directory named review.md

Key files modified:
- containers/review.py: Added detect_review_md function with debug logging
- tests/unit/containers/test_review.py: Added 11 comprehensive unit tests
- tests/unit/containers/conftest.py: Created to isolate from forge imports

Supports:
- SC-006: Project override precedence
- SC-010: No reviewer when review.md absent

Closes: AISOS-2054
Added parse_verdict(output_text: str) -> tuple[Verdict, str] to containers/review.py:

- Performs case-insensitive search for APPROVED and REJECTED markers
- Returns (Verdict.APPROVED, "") when APPROVED marker found first (SC-002)
- Returns (Verdict.REJECTED, feedback) when REJECTED marker found first,
  extracting all text after the marker as feedback (SC-002)
- Returns (Verdict.REJECTED, "Verdict could not be parsed") when neither marker present
- When both markers present, first occurrence wins (position-based)
- Feedback text is stripped of leading/trailing whitespace
- Empty feedback handled correctly (SC-003 edge case)

Added 23 comprehensive unit tests covering:
- Case-insensitive matching (uppercase, lowercase, mixed case)
- Feedback extraction from text after REJECTED marker
- Edge cases: empty string, whitespace-only, neither marker
- Both markers present with position-based precedence
- Real-world review scenarios

All 57 tests pass.

Closes: AISOS-2055
Added write_cycle_file(workspace, step_name, cycle_data) to containers/review.py:
- Creates .forge/{step_name}/ directory if it doesn't exist
- Writes review_cycle_N.json where N is cycle_data.cycle
- Serializes ReviewCycleData to JSON with 2-space indentation
- Ensures verdict is lowercase string ('approved' or 'rejected')
- Handles both Verdict enum and string values for verdict field
- Uses ensure_ascii=False for proper Unicode handling

Added 18 unit tests in TestWriteCycleFile class verifying:
- Directory creation (.forge/ and step subdirectory)
- File output path pattern (review_cycle_N.json)
- All required JSON fields (SC-007)
- ISO 8601 UTC timestamp format preservation
- Verdict lowercase conversion and enum handling
- Multiple cycles written separately
- JSON pretty printing
- File overwrite behavior
- Empty/multiline/special character feedback handling
- Float precision for elapsed_seconds
- Step name variations (hyphens, underscores)

Closes: AISOS-2056
Integrates the review loop into the container entrypoint after skill execution:

- Import review functions from containers/review.py (detect_review_md,
  parse_review_config, parse_verdict, write_cycle_file, ReviewCycleData, Verdict)
- Add run_reviewer_agent() to spawn reviewer with review.md instructions (SC-001)
- Add load_conversation_history() to load from .forge/history/{task_key}.json
- Add run_worker_with_feedback() for retry with feedback injection (SC-003):
  - Injects '## Reviewer Feedback' section into task description
  - Loads conversation history for context
- Add run_review_loop() for complete review loop:
  - Parse config via parse_review_config (SC-004, SC-005)
  - Track cycle timing with time.perf_counter()
  - Parse verdict via parse_verdict (SC-002)
  - Write cycle file to .forge/{step_name}/review_cycle_N.json (SC-007)
  - Exit on APPROVED or max retries exhausted (BR-005)
- Modify main() to:
  - Extract skill_name from task_data or FORGE_SKILL_NAME env var
  - Check for review.md via detect_review_md after run_agent_task
  - Run review loop if review.md exists (SC-001, SC-010)

Add 17 unit tests covering all acceptance criteria:
- load_conversation_history: loads existing, handles missing, handles JSON error
- run_worker_with_feedback: feedback injection, history loading
- run_review_loop: APPROVED exits, REJECTED retries, max retries, frontmatter,
  env var fallback, cycle file path, reviewer instructions, timing, error handling
- main integration: skips when no skill_name, skips when no review.md, runs when exists

Closes: AISOS-2057
… mode

Add terminal progress output to the review loop in containers/entrypoint.py:

- Added TTY detection with _IS_TTY = sys.stdout.isatty()
- Added _print_review_progress() function for terminal output:
  - Format: "Review cycle N/M: VERDICT - [truncated feedback]"
  - Only outputs when stdout is a TTY (local/terminal mode)
  - Truncates feedback to 200 chars with ellipsis when exceeded
  - Uses flush=True for immediate output (SC-011: within 1 second)
- Integrated into run_review_loop() immediately after verdict determination
- Added 8 unit tests for the new functionality

Supports SC-011: terminal displays progress within 1 second of verdict.

Closes: AISOS-2058
Implements async polling for review cycle files during container execution.

Changes:
- Create ReviewCyclePoller class in src/forge/observability/review_poller.py
  - Accepts step_name parameter for step-specific path detection
  - Polls .forge/{step-name}/review_cycle_*.json files
  - Implements async polling loop with configurable interval
  - JSON parsing with retry handles race conditions with container writes
  - Tracks already-processed files to avoid duplicates
- Add ReviewCycleData dataclass for orchestrator-side review cycle data
- Add AUTO_REVIEW_POLL_INTERVAL (default 5s) to forge.config.Settings
- Add aiofiles>=24.1.0 dependency to pyproject.toml
- Export ReviewCyclePoller and ReviewCycleData from forge.observability
- Add 32 unit tests with full coverage

Closes: AISOS-2059
…ode switching

Implemented review cycle recording functionality for observability:

- Created src/forge/observability/review_recorder.py with:
  - ReviewCycleData dataclass with fields: cycle, max_cycles, verdict, feedback,
    skill, elapsed_seconds, timestamp (datetime type)
  - to_dict() and to_json() methods for serialization
  - ReviewCycleRecorder class with step_name parameter and mode switching:
    - mode='log': logs cycle data at INFO level with structured format
    - mode='copy': copies files to {recording_dir}/{step-name}/review_cycle_*.json
    - mode=None: recording disabled
  - step_dir property for step-specific subdirectory path
  - record() method for logging, record_file() method for copying

- Added AUTO_REVIEW_RECORD_POLLED_FILES setting to forge.config.Settings
  - Type: Literal['log', 'copy'] | None (default: None)
  - Placed under Auto Review Configuration section

- Updated observability module __init__.py to export new classes

- Created comprehensive unit tests (34 tests) in
  tests/unit/observability/test_review_recorder.py covering:
  - ReviewCycleData fields, to_dict(), to_json()
  - ReviewCycleRecorder initialization with all mode combinations
  - Log mode INFO level logging with structured format
  - Copy mode file copying and directory creation
  - Error handling for missing files and directory creation failures
  - Settings configuration validation

All tests pass.

Closes: AISOS-2060
Extend src/forge/api/routes/metrics.py with Prometheus metrics for review cycle monitoring:

- forge_review_cycles_total Counter: Total review cycles detected, labeled by skill and step
- forge_review_verdicts_total Counter: Review verdicts by outcome (approved/rejected), labeled by skill, step, and verdict
- forge_review_duration_seconds Histogram: Review cycle duration, labeled by skill and step, using same buckets as AGENT_DURATION [1, 5, 10, 30, 60, 120, 300, 600]

Added helper functions for recording metrics:
- record_review_cycle(skill, step)
- record_review_verdict(skill, step, verdict)
- observe_review_duration(skill, step, duration)

Added 11 unit tests in TestReviewCycleMetrics class verifying:
- Counter and histogram existence and labeling
- Helper function increment/observe behavior
- Metrics visibility in /metrics endpoint output

Closes: AISOS-2061
Create src/forge/observability/review_notifier.py module for posting
Jira comments when review cycles are detected during container execution.

Changes:
- Add ReviewJiraNotifier class with rate limiting support (default 30s)
- Format ReviewCycleData as readable Jira comments with:
  - Cycle number and max cycles with header
  - Verdict with ✅/❌ icons (APPROVED/REJECTED)
  - Skill name and duration (seconds or minutes)
  - Feedback summary (truncated at 500 chars with word boundary)
- Handle Jira API errors gracefully: log WARNING and continue
- Add NotifyResult dataclass for notify operation results
- Export ReviewJiraNotifier and NotifyResult from forge.observability
- Add 31 comprehensive unit tests covering:
  - Initialization and defaults
  - Rate limiting behavior
  - Comment formatting
  - Error handling
  - Integration scenarios

Follows existing JiraClient patterns from forge.workflow.utils.jira_status

Closes: AISOS-2062
…name parameter

Implementation:
- Extended ContainerResult dataclass with review_cycles field (list[ReviewCycleData])
- Added optional step_name parameter to ContainerRunner.run() method
- Added _poll_review_cycles() background task that:
  - Polls workspace for .forge/{step-name}/review_cycle_*.json files
  - Records cycles via ReviewCycleRecorder (log mode)
  - Emits Prometheus metrics for observability
- Added _poller_to_recorder_cycle() helper for ReviewCycleData conversion
- Background polling task cancelled cleanly on container exit
- Final poll performed after container exits to catch remaining files
- Collected review_cycles included in all ContainerResult returns

Unit tests:
- 14 tests in tests/unit/sandbox/test_runner_review_polling.py covering:
  - ContainerResult review_cycles field
  - Poller to recorder cycle conversion
  - Step name parameter acceptance
  - Background polling task lifecycle
  - Review cycle collection and metrics recording
  - Step name path organization

Closes: AISOS-2063
Added _sweep_review_cycles() private method to ContainerRunner that performs
a synchronous post-execution sweep after the container exits. This catches any
review cycle files that may have been missed during async polling, especially
when containers exit quickly after writing files.

Implementation details:
- Scans .forge/{step-name}/review_cycle_*.json for all files
- Deduplicates against files already processed by async poller
- Parses and validates JSON, adding to review_cycles list
- Records via recorder and emits Prometheus metrics
- Logs warning when files are missed during async polling
- Handles edge cases: invalid JSON, missing fields, empty files

Added 11 unit tests covering:
- Basic sweep functionality and deduplication
- Edge case handling (missing dir, invalid JSON, etc.)
- Integration tests for fast-exit scenario
- Warning logging verification

Closes: AISOS-2064
Updated all workflow nodes that invoke ContainerRunner to pass the
step_name parameter for proper review cycle file organization:

- implementation.py: step_name='implement_task'
- local_reviewer.py: step_name='local_review' (bug and feature reviews)
- ci_evaluator.py: step_name='analyze_ci' and step_name='fix_ci'
- code_review.py: step_name='code_review'
- docs_updater.py: step_name='update_docs'
- implement_review.py: step_name='implement_review_analyze' and 'implement_review_fix'
- plan_bug_fix.py: step_name='plan_bug_fix'
- rca_analysis.py: step_name='analyze_bug' and step_name='reflect_rca'
- rebase.py: step_name='rebase'

This enables the orchestrator to poll for and record files at
.forge/{step-name}/review_cycle_*.json during container execution.

Added unit tests for step_name propagation in both implementation.py
and local_reviewer.py. All 630+ unit tests pass.

Closes: AISOS-2065
…fication

Created docs/reference/review-md-schema.md with complete schema specification:
- Overview explaining purpose of review.md files
- File location and naming convention (review.md alongside SKILL.md)
- Per-project override precedence rules (project > default)
- YAML frontmatter schema with max_retries field (optional, default: 3)
- Priority chain: frontmatter > AUTO_REVIEW_MAX_RETRIES env var > default
- Prose body format for reviewer instructions with best practices
- Edge cases: empty file, frontmatter-only, missing frontmatter, malformed YAML
- Review cycle output path: .forge/{step-name}/review_cycle_N.json
- JSON schema for review cycle output files
- Validation summary table

Follows existing documentation style from docs/reference/ files and
is consistent with SKILL.md frontmatter patterns.

Closes: AISOS-2066
Add default review instructions for code implementation tasks with:
- YAML frontmatter with max_retries: 3
- Four review criteria: test coverage, error handling, documentation, no debug code
- Language-agnostic examples (Python, Go, Node.js, Rust)
- Instructions to use git diff origin/main...HEAD to see changes
- APPROVED/REJECTED verdict markers
- Structured feedback format for rejections with file/line/criterion

Closes: AISOS-2067
Add default review instructions for pre-PR code review quality gate. The
review.md complements the existing SKILL.md by adding an automated quality
gate that runs after the local-code-review skill completes.

Key features:
- YAML frontmatter with max_retries: 2 (lower than implement-task's 3 since
  this is a secondary review pass)
- Instructions to detect and run project test command (pyproject.toml,
  package.json, Makefile, Cargo.toml, go.mod)
- Four review criteria:
  1. Breaking changes: API/interface changes without migration
  2. Test failures: all tests must pass locally
  3. Security issues: hardcoded secrets, SQL injection, path traversal
  4. Spec alignment: changes match task description
- APPROVED/REJECTED verdict format with structured feedback examples

Closes: AISOS-2068
…plement-task

Created skills/aisos/implement-task/review.md as a reference example demonstrating
per-project skill customization. The override:

- Sets max_retries: 2 (customized from default of 3)
- Demonstrates inheritance by referencing default review criteria
- Adds Forge-specific criteria:
  - Testing with pytest (fixtures, parametrize, pytest-asyncio)
  - Type hints required on all public functions (PEP 604 style)
  - Logging via logging module (no print() for diagnostics)
  - Proper async/await patterns (gather vs sequential, no blocking I/O)
- Includes documentation for skill authors on how to create overrides
- Shows feedback format examples with Forge-specific file paths

This serves as documentation by example for teams creating their own
project-specific skill overrides.

Closes: AISOS-2069
Detailed description:
- Created docs/guide/auto-review.md documenting the auto-review primitive
- Covers all required sections:
  - Overview: What auto-review is and when to use it
  - Quick Start: Minimal review.md example
  - Configuration: max_retries in frontmatter, AUTO_REVIEW_MAX_RETRIES env var, default (3)
  - Writing Review Instructions: How to write effective reviewer prompts
  - Verdict Protocol: APPROVED/REJECTED markers with examples and feedback format
  - Per-Project Overrides: How skill resolution works for review.md
  - Observability: Review cycle files at .forge/{step-name}/review_cycle_N.json
  - Troubleshooting: Timeout, parsing failures, endless loops, and more
- Includes mermaid flowchart showing review loop
- Links to related docs (skills authoring, review-md-schema)
- Follows documentation style from existing docs/guide/ files

Closes: AISOS-2070
Updated three documentation files to reference the new auto-review capability:

- docs/skills/authoring.md: Added Auto-Review section explaining that skills
  can include review.md for automatic quality gates, with link to the guide
- docs/skills/index.md: Updated directory layout to show review.md alongside
  SKILL.md, added Optional Files table with review.md description
- docs/skills/defaults.md: Added Auto-review notes under implement-task and
  local-code-review sections describing what their review.md files check

All changes link to docs/guide/auto-review.md for full details. Changes are
minimal and focused — the full documentation is in the auto-review guide.

Closes: AISOS-2071
Fixed ruff I001 lint error in src/forge/observability/__init__.py:
- Changed multi-line import blocks to single-line imports
- Imports now properly sorted alphabetically by module name

All tests pass, no breaking issues found in code review.

Closes: AISOS-741-review
…ig reference

Added new Auto-Review section to docs/reference/config.md documenting:
- AUTO_REVIEW_MAX_RETRIES: default max retry attempts
- AUTO_REVIEW_POLL_INTERVAL: polling interval for review cycle files
- AUTO_REVIEW_RECORD_POLLED_FILES: recording mode for observability

These settings were added to src/forge/config.py as part of the auto-review
feature but were not reflected in the configuration reference documentation.

Closes: AISOS-741-docs
Comment thread containers/entrypoint.py
JoshSalomon and others added 6 commits July 2, 2026 13:24
The entrypoint imports from review.py but the Containerfile only
copied entrypoint.py. This caused ImportError at container startup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test_review.py tests import from the containers/ directory which
isn't a Python package. Added sys.path setup in the test conftest
following the existing pattern used by integration tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
poll() was declared async but only did 'return self'. As an async
def, this wraps self in a coroutine, breaking 'async for' on
Python 3.12 (CI) which requires __aiter__ on the direct return
value. Making it a regular method returns self directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
poll() is correctly async (returns a coroutine wrapping self). The
caller in runner.py must await it to get the async iterator. The
previous fix incorrectly made poll() sync, which broke the poller
tests on Python 3.12.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The poll() method is async, so the mock must be AsyncMock for
await to work on Python 3.12. MagicMock returns a non-awaitable
which fails with "can't be used in 'await' expression".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
poll() only sets a flag and returns self — no I/O, no reason to be
async. As async def, it wraps self in a coroutine which breaks
'async for' on Python 3.11 (CI) where coroutines don't satisfy
__aiter__. Making it sync aligns with how runner.py calls it:
'async for new_cycles in poller.poll():'

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant